Search CORE

26 research outputs found

Integrating analytics with relational databases

Author: Raasveldt M. (Mark)
Publication venue
Publication date: 27/08/2018
Field of study

In order to uncover insights and trends, it is an increasingly common practice for companies of all shapes and sizes to gather large quantities of data and to then analyze that data. This data can come from a multitude of different sources, ranging from data gathered about consumer behavior to data gathered from sensors. The most prevalent way of storing and managing data has traditionally been a relational database management system (RDBMS). However, there is currently a disconnect between the tools used for analysis of data and the tools used for storing that data. Instead of working directly with RDBMSes, these tools are build to work in a stand-alone fashion, and offer integration with RDBMSes as an afterthought. The focus of my PhD research is on investigating different methods of combining popular analytical tools (such as R or Python) with database management systems in an efficient and user-friendly fashion

CWI's Institutional Repository

Integrating analytics with relational databases

Author: Raasveldt M. (Mark)
Publication venue
Publication date: 09/06/2020
Field of study

The database research community has made tremendous strides in developing powerful database engines that allow for efficient analytical query processing. However, these powerful systems have gone largely unused by analysts and data scientists. This poor adoption is caused primarily by the state of database-client integration. In this thesis we attempt to overcome this challenge by investigating how we can facilitate efficient and painless integration of analytical tools and relational database management systems. We focus our investigation on the three primary methods for database-client integration: client-server connections, in-database processing and embedding the database inside the client application.PROMIMOOCAlgorithms and the Foundations of Software technolog

CWI's Institutional Repository

Leiden University Scholary Publications

Vectorized UDFs in column-stores

Author: Mühleisen H.F. (Hannes)
Raasveldt M. (Mark)
Publication venue
Publication date: 18/12/2016
Field of study

CWI's Institutional Repository

Data Management for Data Science - Towards Embedded Analytics

Author: Mühleisen H.F. (Hannes)
Raasveldt M. (Mark)
Publication venue
Publication date: 12/01/2020
Field of study

The rise of Data Science has caused an influx of new usersin need of data management solutions. However, insteadof utilizing existing RDBMS solutions they are opting touse a stack of independent solutions for data storage andprocessing glued together by scripting languages. This is notbecause they do not need the functionality that an integratedRDBMS provides, but rather because existing RDBMS im-plementations do not cater to their use case. To solve theseissues, we propose a new class of data management systems:embedded analytical systems. These systems are tightlyintegrated with analytical tools, and provide fast and effi-cient access to the data stored within them. In this work,we describe the unique challenges and opportunities w.r.tworkloads, resilience and cooperation that are faced by thisnew class of systems and the steps we have taken towardsaddressing them in the DuckDB system

CWI's Institutional Repository

Don't hold my data hostage - A case for client protocol redesign

Author: Mühleisen H.F. (Hannes)
Raasveldt M. (Mark)
Publication venue
Publication date: 28/08/2017
Field of study

Transferring a large amount of data from a database to a client program is a surprisingly expensive operation. The time this requires can easily dominate the query execution time for large result sets. This represents a significant hurdle for external data analysis, for example when using statistical software. In this paper, we explore and analyse the result set serialization design space. We present experimental results from a large chunk of the database market and show the inefficiencies of current approaches. We then propose a columnar serialization method that improves transmission performance by an order of magnitude

CWI's Institutional Repository

Integrating analytics with relational databases

Author: Raasveldt M.
Publication venue
Publication date: 09/06/2020
Field of study

Leiden University Scholary Publications

Deep integration of machine learning Into column stores

Author: Holanda P.T. (Pedro)
Manegold S. (Stefan)
Mühleisen H.F. (Hannes)
Raasveldt M. (Mark)
Publication venue
Publication date: 01/01/2018
Field of study

We leverage vectorized User-Defined Functions (UDFs) to efficiently integrate unchanged machine learning pipelines into an analytical data management system. The entire pipelines including data, models, parameters and evaluation outcomes are stored and executed inside the database system. Experiments using our MonetDB/Python UDFs show greatly improved performance due to reduced data movement and parallel processing opportunities. In addition, this integration enables meta-analysis of models using relational queries

CWI's Institutional Repository

Leiden University Scholary Publications

Optimizing group-by and aggregation using GPU-CPU co-processing

Author: Boncz P.A. (Peter)
Gomes Tomé D. (Diego)
Gubner T.K. (Tim)
Raasveldt M. (Mark)
Rozenberg E. (Eyal)
Publication venue
Publication date: 27/08/2018
Field of study

While GPU query processing is a well-studied area, real adoption is limited in practice as typically GPU execution is only significantly faster than CPU execution if the data resides in GPU memory, which limits scalability to small data scenarios where performance tends to be less critical. Another problem is that not all query code (e.g. UDFs) will realistically be able to run on GPUs. We therefore investigate CPU-GPU co-processing, where both the CPU and GPU are involved in evaluating the query in scenarios where the data does not fit in the GPU memory.As we wish to deeply explore opportunities for optimizing execution speed, we narrow our focus further to a specific well-studied OLAP scenario, amenable to such co-processing, in the form of the TPC-H benchmark Query 1.For this query, and at large scale factors, we are able to improve performance significantly over the state-of-the-art for GPU implementations; we present competitive performance of a GPU versus a state-of-the-art multi-core CPU baseline a novelty for data exceeding GPU memory size; and finally, we show that co-processing does provide significant additional speedup over any of the processors individually.We achieve this performance improvement by utilizing parallelism-friendly compression to alleviate the PCIe transfer bottleneck, query-compilation-like fusion of the processing operations, and a simple yet effective scheduling mechanism. We hope that some of these features can inspire future work on GPU-focused and heterogeneous analytic DBMSes.</p

CWI's Institutional Repository

DuckDB

Author: Mühleisen H.F. (Hannes)
Raasveldt M. (Mark)
Publication venue
Publication date: 05/12/2018
Field of study

CWI's Institutional Repository